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Abstract 

A new method of bandwidth selection for kernel density estimators is 
proposed. The method, termed indirect cross-validation, or ICV, makes use 
of so-called selection kernels. Least squares cross-validation (LSCV) is used 
to select the bandwidth of a selection-kernel estimator, and this bandwidth is 
appropriately rescaled for use in a Gaussian kernel estimator. The proposed 
selection kernels are linear combinations of two Gaussian kernels, and need not 
be unimodal or positive. Theory is developed showing that the relative error 
of ICV bandwidths can converge to at a rate of re -1 / 4 , which is substantially 
better than the n -1 / 10 rate of LSCV. Interestingly, the selection kernels that 
are best for purposes of bandwidth selection are very poor if used to actually 
estimate the density function. This property appears to be part of the larger 
and well-documented paradox to the effect that "the harder the estimation 
problem, the better cross-validation performs." The ICV method uniformly 
outperforms LSCV in a simulation study, a real data example, and a simulated 
example in which bandwidths are chosen locally. 

KEYWORDS: Kernel density estimation; Bandwidth selection; Cross-validation 
Local cross-validation. 
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1 Introduction 



Let Xi, . . . , X n be a random sample from an unknown density /. A kernel density 
estimator of f(x) is 

'x - Xi 
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i=l 



(1) 



where h > is a smoothing parameter, also known as the bandwidth, and K is the 
kernel, which is generally chosen to be a unimodal probability density function that 
is symmetric about zero and has finite variance. A popular choice for K is the Gaus- 
sian kernel: (f)(u) = (27r) -1 / 2 exp(— u 2 /2). To distinguish between estimators with 
different kernels, we shall refer to estimator (00) with given kernel K as a K-kernel 
estimator. Choosing an appropriate bandwidth is vital for the good performance 
of a kernel estimate. This paper is concerned with a new method of data-driven 
bandwidth selection that we call indirect cross-validation (ICV). 

Many data-driven methods of bandwidth selection have been proposed. The 
two most widely used are least squares cross-validation, proposed independently 



by Rudemo (1982) and Bowman (1984), and the Sheather and Jones (1991) plug-in 
method. Plug-in produces more stable bandwidths than does cross-validation, and 
hence is the currently more popular method. Nonetheless, an argument can be made 
for cross-validation since it requires fewer assumptions than plug-in and works well 



when the density is difficult to estimate; see Loader (1999) A survey of bandwidth 



selection methods is given by Jones, Marron, and Sheather (1996) 



A number of modifications of LSCV has been proposed in an attempt to improve 



its performance. These include the biased cross-validation method of Scott and Terrell (1987) 



a method of Chiu (1991a), the trimmed cross-validation of Feluch and Koronacki (1992) 



the modified cross-validation of Stute (1992) , and the method of Ahmad and Ran (2004) 



based on kernel contrasts. The ICV method is similar in spirit to one-sided cross- 
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validation (OSCV), which is another modification of cross-validation proposed in 



the regression context by Hart and Yi (1998) As in OSCV, ICV initially chooses 



the bandwidth of an L-kernel estimator using least squares cross-validation. Multi- 
plying the bandwidth chosen at this initial stage by a known constant results in a 
bandwidth, call it h IC v, that is appropriate for use in a Gaussian kernel estimator. 

A popular means of judging a kernel estimator is the mean integrated squared 
error, i.e., MISE(h) = E [ISE(h)}, where 



ISE(h)= (f h (x)-f(x)) 2 dx. 

J —oo 



Letting ho be the bandwidth that minimizes MISE(h) when the kernel is Gaussian, 
we will show that the mean squared error of hicv as an estimator of ho converges 
to at a faster rate than that of the ordinary LSCV bandwidth. We also describe 
an unexpected bonus associated with ICV, namely that, unlike LSCV, it is robust 
to rounded data. A fairly extensive simulation study and two data analyses confirm 
that ICV performs better than ordinary cross-validation in finite samples. 

2 Description of indirect cross-validation 

We begin with some notation and definitions that will be used subsequently. For an 
arbitrary function g, define 



R(g) = j g( u ) du, fi jg = J u 3 g(u)du. 
The LSCV criterion is given by 

n 

LSCV(h) = R(f h ) - - V f h ,-i(Xi), 

i=i 
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where, for i = 1, . . . , n, fh-i denotes a kernel estimator using all the original obser- 



vations except for Xj. When fa uses kernel K, LSCV can be written as 





(2) 



It is well known that LSCV(h) is an unbiased estimator of MISE(h) — J f 2 (x) dx, 



2.1 The basic method 

Our aim is to choose the bandwidth of a second order kernel estimator. A second 
order kernel integrates to 1, has first moment 0, and finite, nonzero second moment. 
In principle our method can be used to choose the bandwidth of any second order 
kernel estimator, but in this article we restrict attention to K = <fi, the Gaussian 
kernel. It is well known that a 0-kernel estimator has asymptotic mean integrated 
squared error (MISE) within 5% of the minimum among all positive, second order 
kernel estimators. 

Indirect cross-validation may be described as follows: 

• Select the bandwidth of an L-kernel estimator using least squares cross-val- 
idation, and call this bandwidth bucv- The kernel L is a second order kernel 
that is a linear combination of two Gaussian kernels, and will be discussed in 
detail in Section 12.21 

• Assuming that the underlying density / has second derivative which is con- 
tinuous and square integrable, the bandwidths h n and b n that asymptotically 
minimize the MISE of <fi- and L-kernel estimators, respectively, are related 



and hence the minimizer of LSCV(h) with respect to h is denoted hucv- 
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as follows: 

1/5 

b n = Cb n . (3) 



( R(4>)t4 



• Define the indirect cross-validation bandwidth by hicv — Cbucv- Impor- 
tantly, the constant C depends on no unknown parameters. Expression (j3J) 
and existing cross-validation theory suggest that hicv /ho will at least con- 
verge to 1 in probability, where ho is the minimizer of MISE for the 0-kernel 
estimator. 

Henceforth, we let hucv denote the bandwidth that minimizes LSCV(h) with 



K = (j). Theory of Hall and Marron (1987) and Scott and Terrell (1987) shows that 



the relative error {hucv — ho) /ho converges to at the rather disappointing rate of 
n -i/io_ j n con trast, we will show that (hicv ~ ho) /ho can converge to at the rate 
77T 1 / 4 . Kernels L that are sufficient for this result are discussed next. 



2.2 Selection kernels 

We consider the family of kernels C = {L( ■ ; a, a) : a > 0, a > 0}, where, for all u, 

L(u; a, a) = (1 + aWu) - ). (4) 

a Vov 

Note that the Gaussian kernel is a special case of (j4j) when a = or o = 1. Each 
member of C is symmetric about and such that \i2L — J u 2 L(u) du = 1 + a — aa 2 . 
It follows that kernels in £ are second order, with the exception of those for which 
cr = a/(1 + a) /a. 

The family L can be partitioned into three families: C%, £2 and £3- The first of 
these is L\ = {L(-;a,a) : a > 0,a < j^}- Each kernel in C\ has a negative dip 
centered at x — 0. For a fixed, the smaller a is, the more extreme the dip; and for 
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Figure 1: Selection kernels in £ 3 . The dotted curve corresponds to the Gaussian 
kernel, and each of the other kernels has a = 6. 

fixed a, the larger a is, the more extreme the dip. The kernels in C\ are ones that 
"cut-out-the-middle. " 

The second family is £2 = {L(-;a,a) : a > 0, < a < l}. Kernels in £ 2 
are densities which can be unimodal or bimodal. Note that the Gaussian kernel is 
a member of this family. The third sub-family is £3 = \L(-;a, a) : a > 0,a > 1}, 
each member of which has negative tails. Examples of kernels in £3 are shown in 
Figure HJ 

Kernels in C\ and £3 are not of the type usually used for estimating /. Nonethe- 
less, a worthwhile question is "why not use L for both cross-validation an d estimation 
of /?" One could then bypass the step of rescaling b UC v and simply estimate / by an 
L-kernel estimator with bandwidth bucv- The ironic answer to this question is that 
the kernels in L that are best for cross-validation purposes are very inefficient for 
estimating /. Indeed, it turns out that an L-kernel estimator based on a sequence 
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of ICV-optimal kernels has MISE that does not converge to faster than rT 1 ! 2 . 
In contrast, the MISE of the best 0-kernel estimator tends to like n~ 4//5 . These 
facts fit with other cross-validation paradoxes, which include the fact that LSCV 



outperforms other methods when the density is highly structured, Loader (1999) 



the improved performance of cross-validation in multivariate density estimation, 



Sain, Baggerly, and Scott (1994), and its improvement when the true density is not 



smooth, van Es (1992) One could paraphrase these phenomena as follows: "The 
more difficult the function is to estimate, the better cross-validation seems to per- 
form." In our work, we have in essence made the function more difficult to estimate 
by using an inefficient kernel L. More details on the MISE of L-kernel estimators 



may be found in Savchuk (2009) 



3 Large sample theory 

The theory presented in this section provides the underpinning for our methodology. 
We first state a theorem on the asymptotic distribution of hicvi an d then derive 
asymptotically optimal choices for the parameters a and a of the selection kernel. 

3.1 Asymptotic mean squared error of the ICV bandwidth 



Classical theory of Hall and Marron (1987) and Scott and Terrell (1987) entails that 



the bias of an LSCV bandwidth is asymptotically negligible in comparison to its 
standard deviation. We will show that the variance of an ICV bandwidth can 
converge to at a faster rate than that of an LSCV bandwidth. This comes at the 
expense of a squared bias that is not negligible. However, we will show how to select 
a and a (the parameters of the selection kernel) so that the variance and squared 
bias are balanced and the resulting mean squared error tends to at a faster rate 
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than does that of the LSCV bandwidth. The optimal rate of convergence of the 
relative error (hicv — ho)/h is n -1 / 4 , a substantial improvement over the infamous 
n-V 10 rate for LSCV. 

Before stating our main result concerning the asymptotic distribution of h IC v, 
we define some notation: 

j(u) — J L(w)L(w + u) du — 2L(u), p(u) — wy'(u), 



{ l<i<j<n 
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Xi — Xj 



+ P 



Xi — Xj 



a 



V27T 

V24(2^) 9/1 ° 



and D r 



3 /(l + a) 2 \ 2/5 



a 5(l + a) 9 / 5 a 1 /5 ™" a 20V2a 2 v^F 
Note that to simplify notation, we have suppressed the fact that L, 7 and p depend 
on the parameters a and a. An outline of the proof of the following theorem is given 
in the Appendix. 

Theorem. Assume that f and its first five derivatives are continuous and bounded 
and that exists and is Lipschitz continuous. Suppose also that 

Ti 2 \b) 



(bucv-bo)-^— = o p (l) 

J-n {Oq) 



(5) 



for any sequence of random variables b such that \b — b \ < \bjjcv — bo\, a.s. Then, 
if a = o(n) and a is fixed, 

hicv — ho 



h 



— Z n S n + B n + o p (S n + B n ), 



as n — > 00 and a — > 00, where Z n converges in distribution to a standard normal 
random variable, 

a/2 



n V ff2/5 ^ 1/10 / R(f" ) 1 / 10 



(6) 



and 



F///N 



Bn ~ U mrW D «- (7) 



Remarks 

Rl. Assumption (jSJ) is only slightly stronger than assuming that bucv/bo converges 
in probability to 1 . To avoid making our paper overly technical we have chosen 
not to investigate sufficient conditions for (JSJ). However, this can be done using 



techniques as in Hall (1983) and Hall and Marron (1987) 



R2. Theorem 4.1 of Scott and Terrell (1987) on asymptotic normality of LSCV 



bandwidths is not immediately applicable to our setting for at least three 
reasons: the kernel L is not positive, it does not have compact support, and, 
most importantly, it changes with n via the parameter a. 

R3. The assumption of six derivatives for / is required for a precise quantification 
of the asymptotic bias of hicv- Our proof of asymptotic normality of bucv 
only requires that / be four times differentiable, which coincides with the 



conditions of Theorem 4.1 in Scott and Terrell (1987) 



R4. The asymptotic bias B n is positive, implying that the ICV bandwidth tends to 
be larger than the optimal bandwidth. This is consistent with our experience 
in numerous simulations. 

In the next section we apply the results of our theorem to determine asymptotically 
optimal choices for a and a. 
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3.2 Minimizing asymptotic mean squared error 

The limiting distribution of [hicv — ho)/h has second moment S% + where S n 
and B n are defined by ([6]) and (JTj). Minimizing this expression with respect to a 
yields the following asymptotically optimal choice for a: 



@n,opt ^ 



C a \ 5/ ' \R(f)R(f") n/5 



in\2 



5/8 

(8) 



The corresponding asymptotically optimal mean squared error is 

MSE n ^ pt = n-^C a D a ^Q^ll' 2 , (9) 

which confirms our previous claim that the relative error of hicv converges to at 
the rate n -1 / 4 . The corresponding rates for LSCV and the Sheather- Jones plug-in 
rule are n~ l l lQ and n~ 5//14 , respectively. 

Because a is not confounded with / in MSE ny0pt , we may determine a single 
optimal value of a that is independent of /. The function C a D a of a is minimized 
at Q!o = 2.4233. Furthermore, small choices of a lead to an arbitrarily large increase 
in mean squared error, while the MSE at a = oo is only about 1.33 times that at 
the minimum. 

Our theory to this point applies to kernels in £ 3 , i.e., kernels with negative 



tails. Savchuk (2009) has developed similar theory for the case where a — > 0, which 
corresponds to L E Ci, i.e., kernels that apply negative weights to the smallest 
spacings in the LSCV criterion. Interestingly, the same optimal rate of n -1 / 4 results 
from letting a — » 0. However, when the optimal values of (a, a) are used in the 
respective cases (er — ► and a — > oo), the limiting ratio of optimum mean squared 
errors is 0.752, with a — » oo yielding the smaller error. Our simulation studies 
confirm that using L with large a does lead to more accurate estimation of the 
optimal bandwidth. 
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4 Practical choice of a and a 



In order to have an idea of how good choices of a and a vary with n and /, we 
determined the minimizers of the asymptotic mean squared error of hjcv f° r various 
sample sizes and densities. In doing so, we considered a single expression for the 
asymptotic mean squared error that is valid for either large or small values of a. 
Furthermore, we use a slightly enhanced version of the asymptotic bias of hicv- 
The first order bias of hicv is Cb — h , or C(b — b n ) + (h n — ho), where 

bn-(^L\ /5 n- 1/5 and K = ( ^-Y" n~^. (10) 



W 2L R(f")J n \^R{f" 

Now, the term h n — h is of smaller order asymptotically than C(b Q — b n ) and hence 
was deleted in the theory of Section [3) Here we retain h n — ho, and hence the a that 
minimizes the mean squared error depends on both n and /. 



We considered the following five normal mixtures defined in the article by Marron and Wand (1992) 



Gaussian density: ^(0, 1) 

Skewed unimodal density: ±iV(0, 1) + Jiv(§, (f) 2 ) + §iv(±§, (f) 1 

Bimodal density: ^(-^ (if) + (f) 2 

Separated bimodal density: §n(-§, (|) 2 ) + 5^(i> (I)' 

Skewed bimodal density: §JV(0, 1) + |iv(§, (|) 2 ). 

These choices for / provide a fairly representative range of density shapes. It is 
worth noting that the asymptotically optimal a (expression (jSJ)) is free of location 
and scale. We may thus choose a single representative of a location-scale family 
when investigating the effect of /. The following remarks summarize our findings 
about a and a. 

• For each n, the optimal value of a (a) is larger (smaller) for the unimodal 
densities than for the bimodal ones. 
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• All of the MSE-optimal a and a correspond to kernels from £3, the family of 
negative-tailed kernels. 

• For each density, the optimal a decreases monotonically with n. Recall from 
Section 3.2 that the asymptotically optimal a is 2.42. For each unimodal 
density, the optimal a is within 13.5% of 2.42 at n — 1000, and for each 
bimodal density is within 18% of 2.42 when n is 20,000. 

In practice it would be desirable to have choices of a and a that would adapt to 
the n and / at hand. However, attempting to estimate optimal values of a and a 
is potentially as difficult as the bandwidth selection problem itself. We have built 
a practical purpose model for a and a by using polynomial regression. The inde- 
pendent variable was log 10 (rz) and the dependent variables were the MSE-optimal 
values of log 10 (a) and log 10 (cr) for the five densities defined above. Using a sixth 
degree polynomial for a and a quadratic for a, we arrived at the following models 
for a and a: 

n — 1 n3- 390- 1.093 log 10(n)+0.025 log 10(n) 3 -0.00004 log 10(n) 6 

"mod lu . . 

(Jmod = 10-°- 58+a386 lo S 10W-0.012 log 10(n)' ; 1Q0 < „ < 50000a 

To the extent that unimodal densities are more prevalent than multimodal densi- 
ties in practice, these model values are biased towards bimodal cases. Our extensive 
experience shows that the penalty for using good bimodal choices for a and a when 
in fact the density is unimodal, is an increase in the upward bias of hicv- Our im- 
plementation of ICV, however, guards against oversmoothing by using an objective 
upper bound on the bandwidth, as we explain in detail in Section 7. We thus feel 
confident in recommending model ( fTTI) for choosing a and a in practice, at least 
until a better method is proposed. Indeed, this model is what we used to choose a 
and a in the simulation study reported upon in Section 7. 
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5 Robustness of ICV to data rounding 



Silverman (1986, p. 52) showed that if the data are rounded to such an extent that 
the number of pairs i < j for which Xj = Xj is above a threshold, then LSCV(h) 
approaches — oo as h approaches zero. This threshold is 0.27n for the Gaussian 



kernel. Chiu (1991b) showed that for data with ties, the behavior of LSCV(h) 
as h — > is determined by the balance between R{K) and 2K(0). In particular, 
lim^o LSCV(h) is -oo and oo when R(K) < 2K(0) and R(K) > 2K(0), re- 
spectively. The former condition holds necessarily if K is nonnegative and has its 
maximum at 0. This means that all the traditional kernels have the problem of 
choosing h = when the data are rounded. 

Recall that selection kernels (Hj) are not restricted to be nonnegative. It turns 
out that there exist a and a such that R(L) > 2L(0) will hold. We say that selection 
kernels satisfying this condition are robust to rounding. It can be verified that the 
negative-tailed selection kernels with a > 1 are robust to rounding when 



-a a + Ja a + (2 - l/\/2)b a 
a> y - , (12) 

where a a = - ^ - 1 + j) and b c = - ^ + ^=). It turns out 
that all the selection kernels corresponding to model (TTTT) are robust to rounding. 
Figure [2] shows the region (Tl2l) and also the curve defined by model (11 II) for 100 < 
n < 500000. Interestingly, the boundary separating robust from nonrobust kernels 
almost coincides with the (a, a) pairs defined by that model. 

6 Local ICV 

A local version of cross-validation for density estimation was proposed and analyzed 



independently by Hall and Schucany (1989) and Mielniczuk, Sarda, and Vieu (1989) 
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Figure 2: Selection kernels robust to rounding have a and a above the solid curve. 
Dashed curve corresponds to the model-based selection kernels. 

A local method allows the bandwidth to vary with x, which is desirable when the 



smoothness of the underlying density varies sufficiently with x. Fan, Hall, Martin, and Patil (1996) 



proposed a different method of local smoothing that is a hybrid of plug-in and cross- 
validation methods. Here we propose that ICV be performed locally. The method 



parallels that of Hall and Schucany (1989) and Mielniczuk, Sarda, and Vieu (1989) 



with the main difference being that each local bandwidth is chosen by ICV rather 
than LSCV. We suggest using the smallest local minimizer of the ICV curve, since 
ICV does not have LSCV's tendency to undersmooth. 

Let fb be a kernel estimate that employs a kernel in the class £, and define, at 
the point x, a local ICV curve by 



ICV(x,b) 



1 



w 



w I nw / — ' 

7 8=1 



x — Xa 



w 



b>0. 



The quantity w determines the degree to which the cross-validation is local, with 
a very large choice of w corresponding to global ICV. Let b(x) be the minimizer of 
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ICV(x, b) with respect to b. Then the bandwidth of a Gaussian kernel estimator 
at the point x is taken to be h(x) = Cb(x). The constant C is defined by ([3]), and 
choice of a and a in the selection kernel will be discussed in Section 8. 

Local LSCV can be criticized on the grounds that, at any x, it promises to be 
even more unstable than global LSCV since it (effectively) uses only a fraction of the 
n observations. Because of its much greater stability, ICV seems to be a much more 
feasible method of local bandwidth selection than does LSCV. We provide evidence 
of this stability by example in Section 8. 



7 Simulation study 

The primary goal of our simulation study is to compare ICV with ordinary LSCV. 
However, we will also include the Sheather- Jones plug-in method in the study. We 
considered the four sample sizes n = 100, 250, 500 and 5000, and sampled from 
each of the five densities listed in Section HI For each combination of density and 
sample size, 1000 replications were performed. Here we give only a synopsis of our 



results. The reader is referred to Savchuk, Hart, and Sheather (2008) for a much 



more detailed account of what we observed. 

Let ho denote the minimizer of ISE(h) for a Gaussian kernel estimator. For 
each replication, we computed h , h* ICV , hucv an d hsJPi- The definition of h* ICV is 



mm(hicv, hos), where hos is the oversmoothed bandwidth of Terrell (1990) Since 
hwv tends to be biased upwards, this is a convenient means of limiting the bias. In 
all cases the parameters a and a in the selection kernel L were chosen according to 
model ( TTTI) . For any random variable Y defined in each replication of our simulation, 
we denote the average of Y over all replications (with n and / fixed) by EiY). Our 
main conclusions may be summarized as follows. 
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The ratio E(h* ICV - Eh ) 2 / E(h UC v ~ Eh ) 2 ranged between 0.04 and 0.70 in 
the sixteen settings excluding the skewed bimodal density. For the skewed 
bimodal, the ratio was 0.84, 1.27, 1.09, and 0.40 at the respective sample sizes 
100, 250, 500 and 5000. The fact that this ratio was larger than 1 in two 
cases was a result of ICV's bias, since the sample standard deviation of the 
ICV bandwidth was smaller than that for the LSCV bandwidth in all twenty 
settings. 

The ratio E(lSE(h* ICV )/ISE(h )) /E(lSE{h UCV )/ISE{h Q )) was smaller than 
1 for every combination of density and sample size. For the two "large bias" 
cases mentioned in the previous remark the ratio was 0.92. 

The ratio E(lSE(h* ICV )/ 1 SE(h )) /E(lSE(h SJP i)/ 1 SE(h )) was smaller than 
1 in six of the twenty cases considered. Among the other fourteen cases, the 
ratio was between 1.00 and 1.15, exceeding 1.07 just twice. 

Despite the fact that the LSCV bandwidth is asymptotically normally dis- 



tributed (see Hall and Marron (1987)), its distribution in finite samples tends 



to be skewed to the left. In contrast, our simulations show that the ICV 
bandwidth distribution is nearly symmetric. 



8 Examples 

In this Section we illustrate the use of ICV with two examples, one involving credit 
scores from Fannie Mae and the other simulated data. The first example is pro- 
vided to compare the ICV, LSCV, and Sheather- Jones plug-in methods for choosing 
a global bandwidth. The second example illustrates the benefit of applying ICV 
locally. 
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8.1 Mortgage defaulters 

In this example we analyze the credit scores of Fannie Mae clients who defaulted 
on their loans. The mortgages considered were purchased in "bulk" lots by Fan- 
nie Mae from primary banking institutions. The data set was taken from the website 



http : / /www . dataminingbook . com associated with Shmueli, Patel, and Bruce (2006) 

In Figure [3] we have plotted an unsmoothed frequency histogram and the LSCV, 
ICV and Sheather- Jones plug-in density estimates for the credit scores. The class 
interval size in the unsmoothed histogram was chosen to be 1, which is equal to 
the accuracy to which the data have been reported. It turns out that the LSCV 
curve tends to — oo when h — > 0, but has a local minimum at about 2.84. Using 
h = 2.84 results in a severely undersmoothed estimate. Both the Sheather- Jones 
plug-in and ICV density estimates show a single mode around 675 and look similar, 
with the ICV estimate being somewhat smoother. Interestingly, a high percentage 
of the defaulters have credit scores less than 620, which many lenders consider the 



minimum score that qualifies for a loan; see Desmond (2008) 



8.2 Local ICV: simulated example 

For this example we took five samples of size n = 1500 from the kurtotic unimodal 



density defined in Marron and Wand (1992) First, we note that even the bandwidth 



that minimizes ISE(h) results in a density estimate that is much too wiggly in the 
tails. On the other hand, using local versions of either ICV or LSCV resulted in 
much better density estimates, with local ICV producing in each case a visually 
better estimate than that produced by local LSCV. 

For the local LSCV and ICV methods we considered four values of w ranging 
from 0.05 to 0.3. A selection kernel with a = 6 and a = 6 was used in local ICV. 
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Unsmoothed frequency histogram LSCV density estimate 




Credit scores 



ICV density estimate SJPI density estimate 




Figure 3: Unsmoothed histogram and kernel density estimates for credit scores. 

This (a, a) choice performs well for global bandwidth selection when the density is 
unimodal, and hence seems reasonable for local bandwidth selection since locally 
the density should have relatively few features. For a given w, the local ICV and 
LSCV bandwidths were found for x = —3, —2.9, . . . , 2.9, 3, and were interpolated at 
other x G [—3, 3] using a spline. Average squared error (ASE) was used to measure 
closeness of a local density estimate fi to the true density /: 

1 61 

i=l 

Figure H] shows results for one of the five samples. Estimates corresponding to the 
smallest and the largest values of w are provided. The local ICV method performed 
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similarly well for all values of w considered, whereas all the local LSCV estimates 
were very unsmooth, albeit with some improvement in smoothness as w increased. 

9 Summary 

A widely held view is that kernel choice is not terribly important when it comes to 
estimation of the underlying curve. In this paper we have shown that kernel choice 
can have a dramatic effect on the properties of cross-validation. Cross-validating 
kernel estimates that use Gaussian or other traditional kernels results in highly 
variable bandwidths, a result that has been well-known since at least 1987. We 
have shown that certain kernels with low efficiency for estimating / can produce 
cross-validation bandwidths whose relative error converges to at a faster rate than 
that of Gaussian-kernel cross-validation bandwidths. 

The kernels we have studied have the form (1 + a)(j>(u) — a<p(u/cr)/a, where <f> 
is the standard normal density and a and a are positive constants. The interesting 
selection kernels in this class are of two types: unimodal, negative-tailed kernels 
and "cut-out the middle kernels," i.e., bimodal kernels that go negative between the 
modes. Both types of kernels yield the rate improvement mentioned in the previous 
paragraph. However, the best negative-tailed kernels yield bandwidths with smaller 
asymptotic mean squared error than do the best "cut-out-the-middle" kernels. 

A model for choosing the selection kernel parameters has been developed. Use 
of this model makes our method completely automatic. A simulation study and 
examples reveal that use of this method leads to improved performance relative to 
ordinary LSCV. 

To date we have considered only selection kernels that are a linear combination of 
two normal densities. It is entirely possible that another class of kernels would work 
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even better. In particular, a question of at least theoretical interest is whether or 
not the convergence rate of n -1 / 4 for the relative bandwidth error can be improved 
upon. 



10 Appendix 



Here we outline the proof of our theorem in Section [3j A much more detailed proof 
is available from the authors. 
We start by writing 

T n (b ) = T n (b ucv ) + (b -b U cv)Ti 1 \b ) + ^b -b ucv ) 2 T^(b) 
= -nR{L)/2 + (b - b UC v)T£\b Q ) + ^{b - b ucv ) 2 T^(b), 

where b is between b and b UC v, and so 

Ti 2) (6) \ T n (b )+nR(L)/2 



ipucv -b ) 1 - ipucv - k 



0) 



27^ (6 )/ -ri 1} (6o) 
Using condition (jSJ) we may write the last equation as 

- T„(M + nR(L)/2 ( TJS) + nR(L)/2 \ 

(W-W= _ r « 1)(6o) + _ r , 1)(w j. (13) 

Defining s 2 n = Var(T n (6 )) and f3 n = E{T n (b )) + nR(L)/2, we have 
T n (b ) + nR(L)/2 T n {b ) - ET n (b ) s n (3 n 



-T^{b ) Sn -T$\bo) -T^\h 



Using the central limit theorem of Hall (1984), it can be verified that 



T n(b ) - ET n (b ) P) 



Computation of the first two moments of T^\bo) reveals that 

— Tn (bo) ^ 



5R(f»)btiit L n 2 /2 
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and so 



T n {b )+nR(L)/2 „ 2s n 2j3 n ( s n + (3 n 



- T W(6 ) " 5R(f»)by 2L ni 5R(f»)bfal L ni p \bfa\ L n\ 

At this point we need the first two moments of T n (6 )- A fact that will be 
used frequently from this point on is that fi2k,L = 0(a 2k ), k = 1,2, . . .. Using our 
assumptions on the smoothness of /, Taylor series expansions, symmetry of 7 about 
and /i 27 = 0, 

9 2 
71 71 

ET n (b ) = --bl^RU") + ^bl^R(f') + 0(n 2 by). 

Recalling the definition of b n from ({TO]) , we have 

2 9 

71 71 

Pn = -^bi^R(f) + — feoW^n 

+-blti L RU")+0{n%y). (14) 

Let MISEiip) denote the MISE of an L-kernel estimator with bandwidth b. Then 
MISE' L (b n ) = (b n - b )MISE>l(b ) + o i(b n - b )MISE'[(b )], implying that 



fc 5 - fc 5 I 5b * MISE 'M 1 



A MISE' L (b n 



(15) 



'MISE'[(b 

Using a second order approximation to MISE' L (b) and a first order approximation 
to MISE'Kb), we then have 

Substitution of this expression for b n into ( fT4l) and using the facts //4 7 = 6/i2 L , 
/i 67 = 30/i 2 L/U4L and 6o°" — °(1)> it follows that /3 n = o{n 2 b 7 Q a & ). Later in the proof 
we will see that this last result implies that the first order bias of hjcv is due only 
to the difference Cbo — h Q . 

Tedious but straightforward calculations show that ~ n 2 boR(f)A Q /2, where 
A a is as defined in Section I3TT1 It is worth noting that A a = R(p a ), where p a (u) = 
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w)' a {u) and j a (u) = (1 + a) 2 f 4>{u + v)(f>(v) dv — 2(1 + ot)4>(u). One would expect 



from Theorem 4.1 of Scott and Terrell (1987) that the factor R(p) would appear in 



Var(T n (6 ))- Indeed it does implicitly, since R(p a ) ~ R(p) as a ^ oo. Our point is 
that, when o — > oo, the part of L depending on a is negligible in terms of its effect 
on R(p) and also R(L). 

To complete the proof write 



hicv — ho 
h 



hicv ~ ho 
h r , 



+ o„ 



hicv ~ ho 
h„ 



bucv ~ b , (Cb - h ) 



+ 



+ o v 



hicv — ho 
h„ 



b n h r 

Applying the same approximation of b that led to (fT5|) . and the analogous one for 
h , we have 

Cbo — ho ^2 ^LPiLR{ f_ ) y2 P24>P^R{ f ) , „/!,2_2 , 



20pl L R(f») 
R{Lf/ 5 p 2L p AL R{f 



n 2 " + o(b 2 n a 2 ). 



+ o(b 2 n a 2 + h 2 r 



2o(p 2 L y/sR(f»y/s 

It is easily verified that, as a — > oo, i2(L) ~ (1 + a) 2 /(2y / 7r), /i 2 L ~ —aa 2 and 
A*4L ~ — 3acx 4 , and hence 



c6 -/i _ {ay/* gCT) n , 

J i?(f') 7/5 



0"\2/5' 

n/ 



The proof is now complete upon combining all the previous results. 
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Figure 4: The solid curves correspond to the local LSCV and ICV density estimates, 
whereas the dashed curves show the kurtotic unimodal density. 
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